Hidden Markov Model

Essence

Model: States are hidden.

$$ \begin{aligned} \begin{matrix} \text{States}: & \boxed{x_1} & \boxed{x_2} & \cdots & \boxed{x_T} \\ & \downarrow & \downarrow & \cdots & \downarrow \\ \text{Observations}: & y_1 & y_2 & \cdots & y_T \\ \end{matrix} \end{aligned} $$

Goal: Determine the states of each moment given observations.

$$ \begin{aligned} \underline{x}^{*} &= \arg\max_{\underline{x}} P[\underline{y}] \\ &= \arg\max_{\underline{x}} \sum_{\underline{x}} P[\underline{y}|\underline{x}]P[\underline{x}] \end{aligned} $$

Definition

$$ \lambda = (A, B, \pi) \text{ or } (N, M, A, B, \pi) $$

Goal

Given observations ($O_t$ is a symbol):

$$ O = \begin{matrix} O_1 & O_2 & \dots & O_T \end{matrix} $$

We want: $$ \arg\max_{\lambda} P[O | \lambda] $$

Problems:

Problem 1: A Hidden Model

The Observation Sequence:

$$ \begin{aligned} O &= \begin{matrix} O_1 & O_2 &\cdots &O_T \end{matrix} \end{aligned} $$

The State Sequence: (The Hidden Model) $$ \begin{aligned} Q &= \begin{matrix} Q_1 & Q_2 &\cdots &Q_T \end{matrix} \end{aligned} $$

Solution 1

Since:

$$ \begin{aligned} &\begin{cases} P[Q|\lambda] = \pi_{Q_1} \cdot a_{Q_1 Q_2} \cdot a_{Q_2 Q_3} \cdots a_{Q_{T-1} Q_T} \\ P[O|Q, \lambda] = \prod_{t=1}^{T} P[O_t | Q_t, \lambda] = b_{Q_1}(O_1) \cdot b_{Q_2}(O_2) \cdots \cdot b_{Q_T}(O_T) \\ \end{cases} \end{aligned} $$

We have:

$$ \begin{aligned} P[O|\lambda] &= \sum_{Q} P[O, Q|\lambda] \\ &= \sum_{Q} P[O|Q, \lambda] \cdot P[Q|\lambda] \\ &= \sum_{Q_1 \cdots Q_T} [b_{Q_1}(O_1) \cdot b_{Q_2}(O_2) \cdots \cdot b_{Q_T}(O_T) ] \cdot [a_{Q_1 Q_2} \cdot a_{Q_2 Q_3} \cdots a_{Q_{T-1} Q_T}] \\ &= \sum_{Q_1 \cdots Q_T} [\pi_{Q_1} \cdot b_{Q_1}(O_1)] \cdot [a_{Q_1 Q_2} \cdot b_{Q_2}(O_2)] \cdots [a_{Q_{T-1} Q_T} \cdot b_{Q_T}(O_T)] \\ P[O|\lambda] &\sim (N^T \cdot 2T) \end{aligned} $$

A simple $N = 5, T = 100$ would require $ 10^{72} $ computations. This solution is impractical.

Solution 2: Forward-Backward Procedure

Forward Phase

Definition: (Forward Variable)

$$ \begin{aligned} \alpha_{t}(i) &= P[O_1 O_2 \cdots O_t, Q_t = S_i | \lambda] \end{aligned} $$

Procedure:

  1. Initialization:

    $$ \begin{aligned} \alpha_1(i) &= \pi_i \cdot b_{i}(O_1) \\ \alpha_1(i) &\sim [\times: 1] \\ \alpha_1 &\sim [\times: N] \end{aligned} $$

  2. Induction:

    $$ \begin{aligned} \alpha_{t+1}(j) &= \left[ \sum_{i=1}^{N} \alpha_{t}(i) \cdot a_{ij} \right] \cdot b_{j}(O_{t+1}) \\ \alpha_{t+1}(j) &\sim [\times: (N+1) ; +: (N-1)] \\ \alpha_{t+1} &\sim [\times:N (N+1) ; +: N (N-1)] \end{aligned} $$

  3. Termination:

    $$ \begin{aligned} P[O|\lambda] &= \sum_{i=1}^{N} \alpha_{T}(i) \end{aligned} $$

Complexity:

$$ \begin{aligned} \alpha_{[1, T]} &\sim [\times: (T-1) \cdot N \cdot (N + 1) + N; +: (T - 1) \cdot N \cdot (N - 1)] \end{aligned} $$

A simple $ N = 5, T = 100$ requires $3000$ computations, comparing to $10^{72}$ computations previously.

The insight comes from the lattice structure due to Markov property (depending only on previous states, like Bayes net).

Backward Phase

Definition: (Backward Variable)

$$ \begin{aligned} \beta_t(i) &= P[O_{t+1} O_{t+2} \cdots O_T | Q_t = S_i, \lambda] \end{aligned} $$

Procedure:

  1. Initialization: $$ \begin{aligned} \beta_{T}(i) &= 1 \end{aligned} $$

  2. Induction: $$ \begin{aligned} \beta_{t}(i) &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot b_j(O_{t+1}) \cdot a_{ij} \\ \beta_{t}(i) &\sim [\times:2N; +: (N - 1)] \\ \beta_{t} &\sim [\times: 2N^2; +: N(N-1)] \end{aligned} $$

Complexity:

$$ \begin{aligned} \beta_{[1, T]} \sim [\times: 2N^2T; + : TN(N - 1)] \end{aligned} $$

Problem 2: The Training Criteria

Target: Given observation, find the mostly likely state at $ t $. (A New Goal)

$$ \begin{aligned} Q_t &= \arg\max_{i \in [1, N]} P[Q_t = S_i | O, \lambda] \\ &= \arg\max_{i \in [1, N]} \gamma_{t(i)} \end{aligned} $$

Then:

$$ \begin{aligned} \gamma_{t}(i) &= P[Q_t = S_i | O, \lambda] = \frac{P[Q_t = S_i, O | \lambda]}{P[O | \lambda]} \\ &= \frac{P[Q_t = S_i, O_1 \cdots O_t O_{t+1} \cdots O_T | \lambda]}{P[O | \lambda]} \\ &= \frac{P[O_{t+1} \cdots O_T | Q_t = S_i, O_1 \cdots O_t, \lambda] \cdot P[Q_t = S_i, O_1 \cdots O_t | \lambda]}{P[O | \lambda]} \\ &= \frac{P[O_{t+1} \cdots O_T | Q_t = S_i, \lambda] \cdot P[Q_t = S_i, O_1 \cdots O_t | \lambda]}{P[O | \lambda]} \\ &= \frac{\beta_t(i) \cdot \alpha_t(i)}{P[O | \lambda]} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^{N} P[Q_t = S_j, O | \lambda]} = \frac{\alpha_t(i) \cdot \beta_t(i)}{\sum_{j=1}^{N} \alpha_t(j) \cdot \beta_t(j)} \\ \end{aligned} $$

where:

$$ \begin{aligned} &P[O_1 \cdots O_{t+1}, Q_{t+1} = S_j | \lambda] \\ &= \sum_{i=1}^{N} P[O_1 \cdots O_t O_{t+1}, Q_t = S_i, Q_{t+1} = S_j | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1}, Q_{t+1} = S_j | O_1 \cdots O_t, Q_t = S_i, \lambda] \cdot P[O_1 \cdots O_t, Q_t = S_i | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1}, Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot P[O_1 \cdots O_t, Q_t = S_i | \lambda] \\ &= \sum_{i=1}^{N} P[O_{t+1} | Q_{t+1} = S_j, Q_t = S_i, \lambda] \cdot P[Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot \alpha_{t}(i) \\ &= \sum_{i=1}^{N} P[O_{t+1} | Q_{t+1} = S_j, \lambda] \cdot P[Q_{t+1} = S_j | Q_t = S_i, \lambda] \cdot \alpha_{t}(i) \\ &= \sum_{i=1}^{N} b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i) \\ &= b_{j}(O_{t+1}) \cdot \sum_{i=1}^{N} a_{ij} \cdot \alpha_{t}(i) \\ &= \alpha_{t+1}(j) \end{aligned} $$

and:

$$ \begin{aligned} &P[O_{t+1} \cdots O_T | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+1} O_{t+2} \cdots O_T, Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+2} \cdots O_T = S_j | O_{t+1}, Q_{t+1} = S_j, Q_{t} = S_i] \cdot P[O_{t+1}, Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} P[O_{t+2} \cdots O_T = S_j | Q_{t+1} = S_j] \cdot P[O_{t+1} | Q_{t+1} = S_j, Q_{t} = S_i] \cdot P[Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot P[O_{t+1} | Q_{t+1} = S_j] \cdot P[Q_{t+1} = S_j | Q_{t} = S_i] \\ &= \sum_{j=1}^{N} \beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \\ &= \beta_{t}(i) \end{aligned} $$

So the former probability is computed forwardly, the latter probability is computed backwardly.

Note that $ \gamma_t(i) $ is also called the smoothed value. This process of using both $ \alpha_t(i) $ and $ \beta_t(i) $ is called smoothing.

Problem 3: Determine Model Parameters

There is no optimal way of estimating the model parameters. There is iterative procedure that reach local maximization(EM algorithm), or gradient techniques.

Iterative Procedure:

Define:

$$ \begin{aligned} \xi_{t}(i, j) &= P[Q_{t+1} = S_j, Q_{t} = S_i | O, \lambda] \\ &= \frac{P[Q_{t+1} = S_j, Q_{t} = S_i , O_1\cdots O_{T} | \lambda]}{P[O|\lambda]} \\ &= \frac{P[O_{t+2}\cdots O_T | Q_{t+1} = S_j, \lambda] \cdot P[O_{t+1} | Q_{t+1} = S_j, \lambda ] \cdot P[Q_{t+1} = S_j| Q_{t} = S_i, \lambda] \cdot P[Q_t = S_i, O_1\cdots O_{t} | \lambda]}{P[O|\lambda]} \\ &= \frac{\beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)}{P[O|\lambda]} \\ &= \frac{\beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)}{\sum_{j=1}^{N} \sum_{i=1}^{N} \beta_{t+1}(j) \cdot b_{j}(O_{t+1}) \cdot a_{ij} \cdot \alpha_{t}(i)} \\ \end{aligned} $$

Then:

$$ \begin{aligned} \gamma_t(i) &= P[Q_t = S_i | O, \lambda] \\ &= \sum_{j=1}^{N} P[Q_{t+1} = S_j, Q_t = S_i | O, \lambda] \\ &= \sum_{j=1}^{N} \xi_{t}(i, j) \end{aligned} $$

Statistical Meaning:

$$ \begin{aligned} \sum_{t=1}^{T - 1} \gamma_{t}(i) &= \mathbb{E}[\text{\#Transitions from $S_i$}] \\ \sum_{t=1}^{T - 1} \xi_{t}(i, j) &= \mathbb{E}[\text{\#Transitions from $S_i$ to $S_j$}] \\ \end{aligned} $$

Re-estimation Formulas:

$$ \begin{aligned} \overline{\pi_i} &= \gamma_{1}(i) \\ \overline{a}_{ij} &= \frac{\mathbb{E}[\text{\#Transitions from $S_i$ to $S_j$}]}{\mathbb{E}[\text{\#Transitions from $S_i$}]} = \frac{\sum_{t=1}^{T - 1} \xi_{t}(i, j)}{\sum_{t=1}^{T - 1} \gamma_{t}(i)} \\ \overline{b}_j(k) &= \frac{\mathbb{E}[\text{\#In $S_j$, observing $V_k$}]}{\mathbb{E}[\text{\#In $S_j$}]} = \frac{\sum_{t=1}^{T}1_{[O_t = V_k]} \cdot \gamma_{t}(j)}{\sum_{t=1}^{T} \gamma_{t}(j)} \end{aligned} $$

Procedure:

  1. Use initial $ \lambda $ to compute a $ \overline{\lambda} $.
  2. If $ P[O | \overline{\lambda}] > P[O | \lambda] $, then update it and repeat.

Stochastic Constraints:

$$ \begin{aligned} \sum_{i=1}^{N} \overline{\pi}_i &= 1 \\ \sum_{j=1}^{N} \overline{a}_{ij} &= 1 \\ \sum_{k=1}^{M} \overline{b}_j(k) &= 1 \end{aligned} $$

Reference

by Jon